Author: Haoyuan Huang
Class: cmps 3160
Date: 10/24/2021
This project is a comprehensive data analytical project on Customer Personality Analysis. The workflow of my project would be divided into following steps:
1. Introduction
2. Data: Extraction, Transform and Load
3. Exploratory Data Analysis
4. Model Construction and Evaluation
5. Conclusion & Suggestions
Since I am a graduate student I will be work on this project alone.
The goal of the project is to explore the personalities for different customers, understand customer behavior and find a pattern to better target different customers.
A company has two methods to get more customers:
In my perspective, these two things will help a firm to understand a customer's personality.
Firstly, I would like to talk my motivation to complete this project using this dataset. I am a master student studying business analytics. I have rich experience in business world and I am passionated to understand how could a business firm or entity is able to maximize their profits and why are some business firms are making more money than others.
Secondly, as a business people, the most important thing is take care of our customers. A super successful business will not only take care about its short-term profit but they are able to keep a good relationship with its customers so they will become a long-term customers. I would love to analyze on customer's behavior and understand how could we target a customer depend on his or her personal needs.
For this project, I will be providing a data science analysis of customer personality and customer churn data. This project aims to use exploratory data analysis to find the historical patterns and evaluating customers in causes of different purchasing behaviors and help business to identify different customer groups so they will able to better target their customer and identify what are some factors will cause a higher churn rate.
I have two datasets. The customer personality data will be able to help me to segment customers. For example, I plan to use customer average product spent, total number thier spent and the time a customer has been a active user as my features to build a k-mean clustering analysis. Since this is a unlabeled dataset so I will create a label based on its purchase frequency so I am planning to use features like age, income, total purchase number, purchase channel and more to build a decision tree model to determine a high-frequency customer vs a low-frequency customer.
I will use second dataset to predict customer churn rate. For example, the data contains variables like whether a customer is a senior citizan or age of a customer, tenure(how long a customer have stayed with their company) and more, so I will able to build a model to help me to evaluate the positive or negative impact this variable will bring to the customer churn rate.
After my consideration for potential datasets, I choosed and got the dataset from Kaggle. Kaggle is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
Customer Personality Dataset Information: The data offers the customer demographic information, products they bought in the past, and the place they bought their product, discount and promotion information on company's product.
Customer Personality Dataset Limitation: Limitions of the data includes data set doesn't have different companies and brand names. This could be another interesting and important business problem to analyze. For example, we can analyze the data to study the factors that contribute to brand retention or improve brand reputation. However, for this particular dataset I will focus on grouping customers and differeciate their behaviors. Another limitation is the dataset is not up to date.
The final dataset can be download here
Customer Churn Dataset Information: The data is a labeled data and we have the information on whether this customer is churned or not. The data contains the customer demographic information, their daily behaviors and their company information, their phone service and internet service and lot more.
Customer Personality Dataset Limitation: The data is not linked to the customer personality data and I think that's a limitation for my analysis so I can not merge two datasets together.
The final dataset can be download here
General Information:
# Required packages
# Load Numpy
import numpy as np
# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt
# Load Seaborn
import seaborn as sns
# Load Plotly
import plotly.express as px
import plotly.io as pio
# Load Pandas
import pandas as pd
# Plot Missing Values
import missingno as msno
# Stats
from scipy.stats import skew
import statsmodels.api as sm
import pylab as py
from scipy.stats import norm
# RE
import re
# This lets us show plots inline and also save PDF plots if we want them
%matplotlib inline
from matplotlib.backends.backend_pdf import PdfPages
matplotlib.style.use('seaborn')
# These two things are for Pandas, it widens the notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
# Show a ludicrus number of rows and columns
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# Import machine learning packages
from scipy import stats
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans
# Import warnings package to ignore warnings
import warnings
warnings.filterwarnings('ignore')
This dataset was extracted from Here. The site gives clearly stated data documentation which includes variable dictionary to explain each varaible. In addtional, the site allows provides the first look of the dataset so I have the options to see the dimension of the data. The data is in csv(comma separated file) so I can directly loaded into my notebook.
I will use pandas to load the data and other necessary packages needed for my analysis. Load the data, I will use pandas read_csv() to read the data using the seperator value as a tab and display the data without any data cleaning.
df = pd.read_csv('marketing_campaign.csv', sep = '\t')
df.head()
Look throught the dataset, there are couple things I would like to examine on the dataset:
# Check missing value
display(df.isna().sum())
display(df[df['Income'].isnull()])
There are 24 missing values in the income column, then I will try to fill the missing values. I will use describe() to see its statistics to determine which method I should use to fill up the missing values.
display(df['Income'].describe())
# Fill NAs with mean
mean = df['Income'].mean()
for i in range(df.shape[0]):
if pd.isnull(df.Income[i]) == True:
df.Income[i] = mean
else:
pass
display(df[df['Income'].isnull()])
I will try to combine features into a binary variable and create a new column. I will use value_counts() to see each varaibles and use replace function to make it a binary variable and assign it to a new variable.
display(df['Education'].unique())
display(df['Education'].value_counts())
df['Education_level'] = df['Education'].apply(lambda x: 'Undergraduate' if x == 'Basic' else 'Graduate')
df.head()
Similarly, the "Marital_Status" column hold the same categorical information so I will keep the column to be a binary variable. I will create a new column to repalce the variables to simplify.
display(df['Marital_Status'].unique())
display(df['Marital_Status'].value_counts())
df['Marital_Status_all'] = df['Marital_Status'].apply(lambda x: 'Relationship' if (x == 'Married' or x == 'Together') else "Single")
df.head()
For example, Kidhome is the number of children in customer's household and Teenhome is the number of teenagers in customer's household. The start of the "Mnt" is the amount spent on "Product" in last 2 years, so I will combine them in into total values. The start of the "NumXXPurchases" is the number of purchases made through "xx" place, so I will combine them in into total purchases.The start of the "Accepted" is if customer accepted the offer in the "xx" campaign and "Response" is 1 if customer accepted the offer in the last campaign.
for index, row in df.iterrows():
df.loc[index,'Number_of_Kids'] = row['Kidhome'] + row['Teenhome']
df.loc[index, 'Total_Product_Spent_per_month'] = row['MntWines']+ row['MntFruits']+ row['MntMeatProducts']+ row['MntFishProducts']+ row['MntSweetProducts']+ row['MntGoldProds']
df.loc[index, 'Total_Purchase_per_month'] = row['NumDealsPurchases']+ row['NumWebPurchases']+ row['NumCatalogPurchases']+ row['NumStorePurchases']
df.loc[index, 'Total_Accept_Offer'] = row['AcceptedCmp1']+ row['AcceptedCmp2']+row['AcceptedCmp3']+row['AcceptedCmp4']+row['AcceptedCmp5']+row['Response']
df.head()
It is also important to compute its average spend as well and the percentage on number of purchases made with a discount
df['Average_Spend_per_product'] = round((df['Total_Product_Spent_per_month'] / df['Total_Purchase_per_month']), 1)
df['Deals_Purchases'] = round((df['NumDealsPurchases'] / df['Total_Purchase_per_month']) * 100, 1)
df.head()
Z_CostContact and Z_Revenue only contains one variable and it is not necessary for the project so I will drop them. I have already known customer's age so it's necessary for me to know keep its birth year. I have combined Kid and Teen so I will also drop them.
df.nunique()
df.drop(columns = ['ID', 'Kidhome', 'Teenhome', 'Z_CostContact', 'Z_Revenue'], axis =1, inplace = True)
df.head()
I will be able to know the duration of a customer have been staying with the company. The dataset contains customer's information from 2012 to 2015 so I will assume the lastest date is 01-01-2016.
# create a new columns named duration
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])
# I assume today's date is 01-01-2016
df['today'] = '01-01-2016'
df['today'] = pd.to_datetime(df['today'])
for index, row in df.iterrows():
df.loc[index, 'Days_since_client'] = row['today'] - row['Dt_Customer']
df.head()
# Remove str in the variable
df['Days_since_client'] = df['Days_since_client'].astype('str')
for index, row in df.iterrows():
df.loc[index, "Days_since_client"] = row["Days_since_client"].split(' ')[0]
df.head()
Now, we get the information on customer engagement and duration since they first registered their account with the company. However, I think it would be more useful if we know their years instead of days.
# Create a new column with the year information
for index, row in df.iterrows():
df.loc[index, "Days_since_client_year"] = int(row['Days_since_client']) / 365
df['Days_since_client_year'] = df['Days_since_client_year'].astype('int')
df.head()
# Drop unnecessary columns
df.drop(['today'], axis =1, inplace = True)
df.head()
Also, it is a good idea to know customer's age as well, because Year_Birth contains their birth date.
df['Age'] = 2016 - df['Year_Birth']
df.drop(columns = 'Year_Birth', axis =1, inplace = True)
df.dropna(inplace = True)
df.isna().sum()
display(df.head())
display(df.dtypes)
# Change column Types
change_cols = [col for col in df.columns if col in ['Income', 'Number_of_Kids','Total_Product_Spent_per_month','Total_Purchase_per_month','Total_Accept_Offer','Age','Average_Spend_per_product','Deals_Purchases']]
for col in change_cols:
df[col] = df[col].astype('int')
display(df.dtypes)
df['Days_since_client'] = df['Days_since_client'].astype('int')
# Reference: https://stackoverflow.com/questions/56486165/changing-data-types-of-multiple-columns-at-once-in-python-pandas
I am also interested to create a column to list out whether this customer has high-frequency purchase behavior or high-frequency purchase behavior. I will use average Total_Purchase_per_month as my threshold to determine a customer.
# Calculate the average and use it as my threshold
print("Average Total puchase per month is:", df['Total_Purchase_per_month'].mean())
# Create the new variable
df['High_freq'] = df['Total_Purchase_per_month'].apply(lambda x: 1 if x > 14.88 else 0)
print("Variable distribution is:", df['High_freq'].value_counts())
df.head()
This dataset was extracted from Here. The site gives clearly stated data documentation which includes variable dictionary to explain each varaible. In addtional, the site allows provides the first look of the dataset so I have the options to see the dimension of the data. The data is in csv(comma separated file) so I can directly loaded into my notebook.
I will use pandas to load the data and other necessary packages needed for my analysis. Load the data, I will use pandas read_csv() to read the data using the seperator value as a tab and display the data without any data cleaning.
df_churn = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df_churn.head()
Look throught the dataset, there are couple things I would like to examine on the dataset:
I will use dtypes to check variable data types and I saw Total Charges is now object so I will use pd.to_numeric() to convert its data type
display(df_churn.dtypes)
display(df_churn.head())
# Converting Total Charges to a numerical data type.
df_churn['TotalCharges'] = pd.to_numeric(df_churn['TotalCharges'], errors='coerce')
Next,I will use isna() to check any missing values in the data and replace missing value with its average value.
df_churn.isna().any()
mean = df_churn['TotalCharges'].mean()
for i in range(df_churn.shape[0]):
if pd.isnull(df_churn.TotalCharges[i]) == True:
df_churn.TotalCharges[i] = mean
else:
pass
df_churn.isna().any()
Next, I will examine some categorical variables and use lambda function to replace Yes and No to 1 and 0, I will remian some columns because I want to use them to finish my explorative analysis.
# Convert all the Yes/No categorical variables into 1/0.
df_churn['Partner'] = df_churn['Partner'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['PhoneService'] = df_churn['PhoneService'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['OnlineSecurity'] = df_churn['OnlineSecurity'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['OnlineBackup'] = df_churn['OnlineBackup'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['DeviceProtection'] = df_churn['DeviceProtection'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['TechSupport'] = df_churn['TechSupport'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['StreamingTV'] = df_churn['StreamingTV'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['StreamingMovies'] = df_churn['StreamingMovies'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['PaperlessBilling'] = df_churn['PaperlessBilling'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['Churn'] = df_churn['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['Dependents'] = df_churn['Dependents'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['gender'] = df_churn['gender'].apply(lambda x: 1 if x == 'Male' else 0)
df_churn.head()
I will need to remove the CustomerID since it's not needed.
df_churn.drop(columns = 'customerID', axis =1, inplace = True)
There is duplicated value in the data. For example, Inside the MutipleLine there are three unique variables(No, No service and Yes), but both No and No service are indicating no mutiple lines. So I will change the duplicated value.
df_churn['MultipleLines'] = df_churn['MultipleLines'].str.split(expand = True)[0]
df_churn['MultipleLines'].value_counts()
Next, I will check the data types again to make sure the data types are clear.
display(df_churn.dtypes)
display(df_churn.head())
Great! Everything is good and I am ready to go to EDA.
However, I will do more data tranformation in the later modeling section to tranform some varaibles.
df_churn.head()
This goal of this section is help me to understand the data better and use statistical computing to find patters in the data. It's also a important step we need to take before we want to build our machine learning models in the later section. The structure of this section would be:
All graphs reference:
plt.figure(figsize=(10,7))
sns.set_theme(style="whitegrid")
plt.xlim(1,200000) # Set x-axis scale
ax = sns.boxplot(x=df["Income"],palette="Spectral")
plt.title('Customer Income Distribution', fontsize=15, weight = 'bold')# Add title
plt.show()
df['Education'] = df['Education'].replace("Graduation", 'Master')
# Create Education plot
plt.figure(figsize=(10,7)) # Reset the plot showing size
plt.subplot(2, 1, 1) # make two plots together
sns.boxplot(x="Education", y="Income",data=df, dodge=True,palette="Spectral")
plt.ylim(1,125000) # Set y-axis scale
plt.title('Education vs Income', fontsize=15, weight = 'bold')
plt.show()
# Create Marriage plot
plt.figure(figsize=(10,7))
plt.subplot(2, 1, 2)
sns.boxplot(x="Marital_Status", y="Income",data=df, dodge=True,palette="Spectral")
plt.ylim(1,125000)
plt.title('Marital Status vs Income', fontsize=15, weight = 'bold')
plt.show()
In the graph, My hypothesis on the customer who has higher education background will have higher income is true. PhD class has slightly higher median income than other education classes and couple super higher income in the data(couple outliers). Basic has the lowest median income level as I expected it. 2rd Cycle class means if a customer comes back to school after couple years of work and their income are close to master class. Income is an important factor to determine a customer's purchasing behavior so we will see it in the modeling section.
My inital hypothesis on the marital Status will have positive impact on the income is not really true. People who are get married then got divorced have slightly higher income compared with people who are together and married. People who are absured have the highest income! This is an interesting factor to know and there are no clear positive/negative correlation between the marital status and income.
df1 = df.groupby(['Education_level','Marital_Status_all'])['Average_Spend_per_product'].mean().reset_index().sort_values('Average_Spend_per_product', ascending = False)
plt.figure(figsize=(10,7))
ax = sns.barplot(x = "Education_level", y = "Average_Spend_per_product", hue = "Marital_Status_all", data = df1,palette='Spectral')
plt.xticks()
for container in ax.containers: #Add number on the top of the plot
ax.bar_label(container)
ax.set_title('Distribution of Income for Each Education Level and Marital Status',weight='bold', size = 15)
plt.show()
Average spend per month and Income are positively correlated. As I expected, the higher the education level (average more than 20+ dollars per month) will lead to a higher income level and higher average spend. However, the marital status is an interesting factor because people who are together with higher education are spending less. However, people with lower education are spending more. My educated guess is because people who are in their higher education are more likely in a long-term relationship like marriage so their average spend per month is lower. People who are in college are more likely in a short-term relationship or a date so the cost are higher.
df2 = df[['Education_level','Marital_Status_all','Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds']]
id_vars = ["Education_level",
"Marital_Status_all",
"Income"
]
f_df = pd.melt(df2,
id_vars = id_vars,
var_name="Category",
value_name="freq")
f_df.head()
result = f_df.groupby('Category')['freq'].sum().reset_index(name = 'Total_amount_spent').sort_values(by = 'Total_amount_spent')
fig = px.pie(result, values = 'Total_amount_spent', names = 'Category', width = 600, height = 600, template='seaborn')
# Add texts into each section and update the line to sepereate them
fig.update_traces(textposition = 'inside',
textinfo = 'percent + label',
hole = 0.8,
marker = dict(line = dict(color = 'white', width = 4)),
hovertemplate = 'Clients: %{value}')
# Add plot title
fig.update_layout(annotations = [dict(text = 'Percentage Spent <br>by Category',
x = 0.5, y = 0.5, font_size = 28, showarrow = False,
font_family = 'monospace',
font_color = 'Black')],
showlegend = False)
fig.show()
One of the most essential factors to understand customer personality is to see where they spent their money. The graph validated my hypothesis and people spent 50% of their money in wines and 28% in meat. Customer spent majority of their money in daily goods and groceries.
df3 = df[['Education_level','Marital_Status_all','Income','Number_of_Kids','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases']]
id_vars = ["Education_level",
"Marital_Status_all",
"Income",
"Number_of_Kids"
]
f_df = pd.melt(df3,
id_vars = id_vars,
var_name="Purchase_place",
value_name="Number_of_purchases")
f_df.head()
df4 = f_df.groupby(['Purchase_place','Number_of_Kids'])['Number_of_purchases'].sum().reset_index()
plt.figure(figsize=(10,7))
fig = sns.barplot(x = "Purchase_place", y = "Number_of_purchases", hue = "Number_of_Kids", data = df4,palette='Spectral')
plt.xticks(rotation=30)# Rotate the x-axis
fig.set_title('Distribution of Number of Purchases for Each Place Type and Number of Kids',weight='bold', size = 15)
plt.show()
From the graph, Customer who has kids will have higher purchase frequency is True, so the purchase frequency and whether a customer has kid are postively correlated to each other. However, the correlation don't adds up because increase the number of kids will not directly increase the number of the purchases made by a cutomer.
Number of purchases made using a catalogue is the only purchase place poeple who don't have kids are made more purchase than people who have kids. Number of purchases made directly in stores is the most popular purchase place and web purchase is the second place. The data was collected before the Covid so I assume this has been changed right now and I think this is a good topic for the future analysis to compare between pre-covid and post-covid circumstance. Number of purchases made with a discount are slightly lower than other purchase chanels for couple reasons. First, time of a discount is limited. Second, necessities are mandatory so customer will make a purchase regardless of a discount.
# Create the scatter plot
ax = sns.scatterplot(
data=df, x="Income", y="Total_Product_Spent_per_month", hue="High_freq", size="High_freq",
sizes=(20, 200), legend="full", palette='Spectral'
)
ax.set_xlim(1,200000)
ax.set_title('Distribution of Number of Purchases for Each Place Type and Number of Kids',weight='bold', size = 15)
plt.show()
My hypothesis is people who are a high-frequency purchaser will have higher income level and higher total product spend per month. From the right graph, we can see that the green dots represent high purchase frequency customers and their distribution is slighty righter than orange dots(low purchase frequency customers), so their income level is higher as a group. On the other hand, majority of Green dots are in the upper section which means the higher total product spent per month. Of course, there are some outliers in the data, we have few people earn a little money but spent super high, and few people earn more money but spent super low.
# Melt data from long to wide
df4 = df[['Education_level','Marital_Status_all','Income','Age','Response','AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5']]
id_vars = ["Education_level",
"Marital_Status_all",
"Income",
"Age"
]
f_df = pd.melt(df4,
id_vars = id_vars,
var_name="Campaign_num",
value_name="Accepted_offer")
# Map to change variable names
f_df['Campaign_num'] = f_df['Campaign_num'].map({
'Response':'AcceptedCmp6',
'AcceptedCmp1': 'AcceptedCmp1',
'AcceptedCmp2':'AcceptedCmp2',
'AcceptedCmp3':'AcceptedCmp3',
'AcceptedCmp4':'AcceptedCmp4',
'AcceptedCmp5':'AcceptedCmp5'
})
f_df.head()
def age_buckets(x):
if x < 30:
return '18-29'
elif x < 40:
return '30-39'
elif x < 50:
return '40-49'
elif x < 60:
return '50-59'
elif x < 70:
return '60-69'
elif x >=70:
return '70+'
else:
return 'other'
f_df['Age_range'] = f_df['Age'].apply(age_buckets)
df5 = f_df.groupby(['Campaign_num','Age_range'])['Accepted_offer'].sum().reset_index()
df5
plt.figure(figsize=(10,7))
fig = sns.barplot(x = "Campaign_num", y = "Accepted_offer", hue = "Age_range", data = df5,palette='Spectral')
fig.set_title('Distribution of Number of Accepted Offer for Each Campaign and Age Range',weight='bold', size = 15)
plt.show()
From the graph, there is a positive correlation between the number of campaigns and number of accepted offer from campaigns in high level. Thus, I will not reject my null hypothesis. AcceptedCmp6 means 1 if customer accepted the offer in the last campaign, and there are more customers who accepted the last campaign than first campaign. Of course, the second campaign has the lowest acceptance rate and different campigns have different performance, but we can see there is a increase trend in the general level. Another interesting factor is age of customer. People who aged between 40 to 49 have the highest rate to accept an offer from a campaign. Customers who are age more than 70 have the lowest rate. People who are old will have less interest to purchase a lot of products so they will have less chance to accept an offer from the marketing campaign. customers are between 30 to 50 will have higher demand and they will try to save their money, so their offer acceptance rate will be higher than others.
# Create the lmplot
sns.lmplot(x='Total_Product_Spent_per_month',y='Days_since_client',fit_reg=True,scatter_kws={"color":"orange","alpha":0.5
,"s":30},data=df)
plt.title('The corelationship between Total Spend per Month and Days Since a Customer', fontsize=15,weight='bold')
plt.xlabel('The total product spent per month',size=15)
plt.ylabel('Days since a customer',size=15)
plt.show()
There are no obvious relationship between the total product spent per month and days since a customer, so I will reject my null hypothesis.
print('The correlation between the total product spend per month and Days since a customer is: '+ str(df['Days_since_client'].corr(df['Total_Product_Spent_per_month'])))
It turns out the correlation is positive but relatively low.
df['Income'] = np.where(df['Income'] > 120000, 120000, df['Income']) # Remove outlier
# Create the lmlplot
sns.lmplot(x='Total_Product_Spent_per_month',y='Income',fit_reg=True,scatter_kws={"color":"orange","alpha":0.5
,"s":30},data=df)
plt.title('The corelationship between Total Spend per Month and Income Level', fontsize=15,weight='bold')
plt.xlabel('The total product spent per month',size=15)
plt.ylabel('Income',size=15)
plt.show()
The graph shows some extend of correlation so I want to do some more analysis to see its actual correlation in number
print('The correlation between the total product spend per month and Income is: ' + str(df['Income'].corr(df['Total_Product_Spent_per_month'])))
It turns out that the loss incurred and the insurance premium is positive correlated
customer_churned = df_churn.loc[df_churn['Churn'] == 1]
customer_not_churned = df_churn.loc[df_churn['Churn'] == 0]
# Create denisity plot
plt.figure(figsize=(10,7))
ax = sns.kdeplot(customer_churned['MonthlyCharges'],
shade = True,palette='Spectral')
ax = sns.kdeplot(customer_not_churned['MonthlyCharges'],
ax =ax,palette='Spectral', shade= True)
ax.legend(["Not Churn","Churn"],loc='upper right') # Show legend
ax.set_ylabel('Density')
ax.set_xlabel('Monthly Charges')
ax.set_title('Distribution of monthly charges by churn',weight='bold', size = 15)
plt.show()
The graph validated my hypothesis which is customer churn rate will be heavily impacted by monthly charges. The orange shade is right-skewed, so majority of churned customers have monthly charges (The amount charged to the customer monthly) around 20. The blue shade is left-skewed and majority unchurned customer have higher monthly chrages around 60 - 100 dollars.
In summary, the higher amount charged to the customer monthly, the lower chance a customer will churn, and vice-versa. It's a negative correlation.
# Create the kde plot for total charges
customer_churned = df_churn.loc[df_churn['Churn'] == 1]
customer_not_churned = df_churn.loc[df_churn['Churn'] == 0]
# Create denisity plot
plt.figure(figsize=(10,7))
ax = sns.kdeplot(customer_churned['TotalCharges'],
shade = True,palette='Spectral')
ax = sns.kdeplot(customer_not_churned['TotalCharges'],
ax =ax,palette='Spectral', shade= True)
ax.legend(["Not Churn","Churn"],loc='upper right') # Show legend
ax.set_ylabel('Density')
ax.set_xlabel('Total Charges')
ax.set_title('Distribution of Total charges by churn',weight='bold', size = 15)
plt.show()
The graph validated my hypothesis which is customer churn rate will be impacted by total charges. Both orange and blue shapes are right-skewed and their patterns are similiar to each other.
In summary, It seems that there is higher churn when the total charges are lower.
churn = df_churn.groupby(['Contract', 'Churn']).size().reset_index(name = 'Number of Customers')
# set plot style: grey grid in the background:
sns.set(style="darkgrid")
# set the figure size
plt.figure(figsize=(10, 7))
fig = sns.barplot(x = "Contract", y = "Number of Customers", hue = "Churn", data = churn,palette='Spectral')
fig.set_title('Distribution of Number of Accepted Offer for Each Campaign and Age Range',weight='bold', size = 15)
plt.xticks()
for container in fig.containers: #Add number on the top of the plot
fig.bar_label(container)
fig.set_title('Distribution of Churn by Contract Type',weight='bold', size = 15)
plt.show()
From the graph, the highest churn rate is when the contract term of the customer is month-to-month, the lowest churn rate is when the contract term is two year. There is a strong negative correlation between the contract term and customer churn rate because the longer the contract term, the lower the customer churn rate will be. It's a proof of my null hypothesis and I think it's matching my expectation. Customers will less likely to churn when they have a longer contract vs a shorter contract.
df6 = df_churn.groupby(['PaymentMethod','Churn']).count()['gender'].reset_index(name = 'Number of Customers')
churned = df6.loc[df6.Churn == 1].sort_values(by = 'Number of Customers')
# Create the bar chart
plt.figure(figsize=(10,7))
fig = sns.barplot(x = "Churn", y = "Number of Customers", hue = "PaymentMethod", data = churned,palette='Spectral')
fig.set_title('Customer Churn Rate by Different Payment Types',weight='bold', size = 15)
plt.xticks()
for container in fig.containers: #Add number on the top of the plot
fig.bar_label(container)
plt.show()
From the graph, Customer churn rate was heavily affected by different payment Methods. Among all of the churned customers, automatic payments will significantly decrease the churn rate compared to non-automatic payments.However, this is meeting my expectation. A customer set an automatic payment will more likely to pay for their subscription or other services, and they will not forget to pay for their bills as well. Customers who pays with electronic check have significantly higher churn rate. Thus, the payment methods is an important variable to affect the churn rate of a customer and it's also important for me to put into my model in the later section
churn = df_churn[df_churn['Churn'] == 1].count()[0]
Not_churn = df_churn[df_churn['Churn'] == 0].count()[0]
# Set size
plt.figure(figsize=(10,7))
# Create pie chart
labels = ['Churn','Not Churn']
colors = ['coral','khaki']
explode = [0.1,0]
plt.pie([churn,Not_churn],labels = labels,autopct = '%.2f %%',pctdistance = 0.8,explode = explode, colors = colors)
plt.title('Customer Churn Rate Distribution',weight='bold', size = 15)
#draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.show()
The graph has validated my hypothesis. In the data, 74% of the customers do not churn. There are only 27% of the customers churnd,so the data is skewed which it's an inbalanced label. We can aniticipate that there are more customers will not churn, so it's important to deal with the skewness of the data before building the model because the model will have the tendency to predict on the majority class, and there will be more false negatives (Type II error) in our prediction.
import matplotlib.ticker as mtick # For specifying the axes tick format
churn = df_churn.groupby(['SeniorCitizen','Churn']).size().unstack()
# Create the stacked bar plot
colors = ['#EEDC82','#EE7942']
ax = (churn.T*100.0 / churn.T.sum()).T.plot(kind='bar',
width = 0.2,
stacked = True,
rot = 0,
figsize = (10,7),
color = colors)
ax.yaxis.set_major_formatter(mtick.PercentFormatter()) # Set the y-axis to be XX%
ax.legend(loc='center',prop={'size':14},title = 'Churn') # Set the legend in the middle
ax.set_ylabel('% Customers')
ax.set_title('Churn by Seniority Level',size = 15, weight = 'bold')
# add the percentage labels on the stacked bar chart
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.annotate('{:.0f}%'.format(height), (p.get_x()+.25*width, p.get_y()+.4*height),
color = 'white',
weight = 'bold',size =14)
# Reference to show the percentage on the bar: https://stackoverflow.com/questions/57453959/how-to-show-percentages-on-the-stacked-bar-chart-in-python-3-x
First, the definition on the Senior Citizens is Whether the customer is a senior citizen or not. The graph validated my hypothesis because the senior citizens have almost double the churn rate than younger population. I think it's a very interesting finding and this variable could be an important factor to influence the customer churn rate.
The goal of this section is use machine learning models to predict customer behaviors and help us to better target our customers in a more effecient and accurate way.I am planning to build three models:
K-mean Clustering from Customer personality data Logistic regression and Random Forest from Customer churn data K-mean model? I choose K-mean from Customer personality data because the dataset is unlabeled and I would like to cluster different customers together to see their common personalities. K-mean is less computation expensive compare to other clustering techniques, and requires less storage and faster process.
Logistic regression and Random Forest model? I choose LR and RF from Customer churn data because the dataset is labeled and I would like to predict on the customer churn rate(churn rate is a binary variable 0 and 1). LR and RF are good for classification prediction to classify differnt categories and they are pretty straight foward and easy to implement as well.
In order to better target a customer, there are two ways in general:
K-mean clustering is able to segment customers into different clusters. The model will be able to identify unqiue characteristics for different groups of customer and I will able to answer my problem in the beginning which is how could a company differeiate customers and tailor different marketing strategy to sell their products. Thus, K-mean model will help me to understand how could I better target a customer.
LR and RF are able to predict customer churn rate so I will able to understand what are some important factors will influence a customer to leave a product and become a churned customer. Thus, these two models will help me to understand how could a company keep the retention rate high and reduce the churn rate.
K-mean Clustering: I am planning to use a customer's Education, Marital_Status, Income, Number_of_Kids, Total_Product_Spent_per_month, Total_Accept_Offer, Total_Purchase_per_month, Age, Average_Spend_per_product, Deals_Purchases, Average_Spend_per_product as my variables(They all come from my EDA analysis) to create my clustering analysis. Clustering is unsupervised machine learning method so I do not need my y-variable.
Logistic Regression and Random Forest: Both models are supervised machine learning models. I will use churn rate as my independend variable(y-variable) and rest of the variables as my depended variables(I am planning to use Gridsearch from sklearn document to reduce data dimensions.
Reference:
# Variable Selection
selected_cols = ['Education_level', 'Marital_Status_all','Number_of_Kids',
'Total_Product_Spent_per_month','Total_Purchase_per_month', 'Total_Accept_Offer','Deals_Purchases',
'Age','Income','Average_Spend_per_product']
personality_df = df[selected_cols]
personality_df.head()
# Correlation Matrix
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
corr = personality_df.corr(method='pearson')
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
colors = ["#45b5aa","white", "#ff6f61"]
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)
# Creat the heatmap
sns.heatmap(corr,
square=True,
mask=mask,
linewidth=2.5,
vmax=0.4, vmin=0,
cmap=cmap,
cbar=False,
ax=ax,annot=True)
ax.set_yticklabels(ax.get_xticklabels(), fontfamily='serif', rotation = 0, fontsize=11)
ax.set_xticklabels(ax.get_xticklabels(), fontfamily='serif', rotation=90, fontsize=11)
ax.spines['top'].set_visible(True)
# Add Fig Text
fig.text(0.97, 1.02, 'Correlation Analysis', fontweight='bold', fontfamily='serif', fontsize=18, ha='right')
fig.text(0.97, 0.912, 'Dataset : Customer Personlaity', fontweight='light', fontfamily='serif', fontsize=14, ha='right')
fig.text(0.97, 0.91, '''
Find out the corelation between variables
and there are eight factors in total.
Before build the model, I want to
find if there are some highly correlated variables.''',
fontweight='light', fontfamily='serif', fontsize=14, va='top', ha='right')
plt.tight_layout()
plt.show()
# Transfrom the categorical vars (select non-numerical columns)
categorical_cols = []
for i in personality_df.columns:
if personality_df[i].dtypes == 'object':
categorical_cols.append(i)
print(categorical_cols)
label_encoder = LabelEncoder()
for i in categorical_cols:
personality_df[i] = personality_df[[i]].apply(label_encoder.fit_transform)
personality_df.head()
There are three popular ways for scaling:
I will use standardizing so my result will have negative values, and I decided to use one of the package from SkLearn which its StandardScaler
scaled_features = StandardScaler().fit_transform(personality_df.values)
scaled_personality_df = pd.DataFrame(scaled_features, index=personality_df.index, columns=personality_df.columns)
scaled_personality_df[:5]
# # personality_df_scaled = (personality_df.
# sub(personality_df.mean(), axis=1).
# divide(personality_df.std(), axis=1))
# personality_df_scaled.head()
# Apply Elbow method
wcss = []
for i in range(1, 11):
km = KMeans(n_clusters = i, random_state = 228)
km.fit(scaled_personality_df)
wcss.append(km.inertia_)
# Set up the title
plt.figure(figsize = (12, 8))
plt.title('The Elbow Method', size = 25, y = 1.03, fontname = 'monospace')
plt.grid(color = 'gray', linestyle = ':', axis = 'y', alpha = 0.8, zorder = 0, dashes = (1,7))
# Create the plot
a = sns.lineplot(x = range(1, 11), y = wcss, color = '#336b87', linewidth = 3)
sns.scatterplot(x = range(1, 11), y = wcss, color = '#336b87', s = 60, edgecolor = 'black', zorder = 5)
plt.ylabel('WCSS', size = 14, fontname = 'monospace')
plt.xlabel('Number of clusters', size = 14, fontname = 'monospace')
plt.xticks(size = 12, fontname = 'monospace')
plt.yticks(size = 12, fontname = 'monospace')
# Create the arrow
plt.annotate('''4 Could be an Optimal number
of clusters''', xy = (4.05, 12000), xytext = (5.8, 16000),
arrowprops = dict(facecolor = 'steelblue', arrowstyle = "->",color = 'black'),
fontsize = 13, fontfamily = 'monospace', ha = 'center', color = 'black')
plt.annotate('''3 Could be an Optimal number
of clusters''', xy = (3, 13000), xytext = (1.8, 10000),
arrowprops = dict(facecolor = 'steelblue', arrowstyle = "->",color = 'black'),
fontsize = 13, fontfamily = 'monospace', ha = 'center', color = 'black')
# Create the vline
plt.axvline(3, 1, 0, color = 'green')
plt.axvline(4, 1, 0)
plt.show()